Using latent semantic analysis and the predication algorithm to improve extraction of meanings from a diagnostic corpus.

نویسندگان

  • Guillermo Jorge-Botana
  • Ricardo Olmos
  • José Antonio León
چکیده

There is currently a widespread interest in indexing and extracting taxonomic information from large text collections. An example is the automatic categorization of informally written medical or psychological diagnoses, followed by the extraction of epidemiological information or even terms and structures needed to formulate guiding questions as an heuristic tool for helping doctors. Vector space models have been successfully used to this end (Lee, Cimino, Zhu, Sable, Shanker, Ely & Yu, 2006; Pakhomov, Buntrock & Chute, 2006). In this study we use a computational model known as Latent Semantic Analysis (LSA) on a diagnostic corpus with the aim of retrieving definitions (in the form of lists of semantic neighbors) of common structures it contains (e.g. "storm phobia", "dog phobia") or less common structures that might be formed by logical combinations of categories and diagnostic symptoms (e.g. "gun personality" or "germ personality"). In the quest to bring definitions into line with the meaning of structures and make them in some way representative, various problems commonly arise while recovering content using vector space models. We propose some approaches which bypass these problems, such as Kintsch's (2001) predication algorithm and some corrections to the way lists of neighbors are obtained, which have already been tested on semantic spaces in a non-specific domain (Jorge-Botana, León, Olmos & Hassan-Montero, under review). The results support the idea that the predication algorithm may also be useful for extracting more precise meanings of certain structures from scientific corpora, and that the introduction of some corrections based on vector length may increases its efficiency on non-representative terms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...

متن کامل

Verbs in Applied Linguistics Research Article Introductions: Semantic and syntactic analysis

This study aims to investigate the semantic and syntactic features of verbs used in the introduction section of Applied Linguistics research articles published in Iranian and international journals. A corpus of 20 research article introductions (10 from each journal) was used. The corpus was analysed for the syntactic features (tense, aspect and voice) and semantic meaning of verbs. The finding...

متن کامل

Verbs in Applied Linguistics Research Article Introductions: Semantic and syntactic analysis

This study aims to investigate the semantic and syntactic features of verbs used in the introduction section of Applied Linguistics research articles published in Iranian and international journals. A corpus of 20 research article introductions (10 from each journal) was used. The corpus was analysed for the syntactic features (tense, aspect and voice) and semantic meaning of verbs. The finding...

متن کامل

Visualizing polysemy using LSA and the predication algorithm

Context is a determining factor in language, and plays a decisive role in polysemic words. Several psycholinguistically-motivated algorithms have been proposed to emulate human management of context, under the assumption that the value of a word is evanescent and takes on meaning only in interaction with other structures. The predication algorithm (Kintsch, 2001), for example, uses a vector rep...

متن کامل

Different Senses Visualization with Latent Semantic Analysis

Some psycholinguistic based algorithms have been proposed to emulate the management of the context that humans do, in the assumption that the value of a word is evanescent and take sense when interact with other structures. For example, predication algorithm (Kintsch, 2001), uses the vector representation of the words that produce LSA (Latent Semantic Analysis) to dynamically simulate the compr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • The Spanish journal of psychology

دوره 12 2  شماره 

صفحات  -

تاریخ انتشار 2009